NLP for Election Sentiment Analysis

Natural Language Processing Application for Political Sentiment Forecasting

Presented By: Dr. Ratnesh Prasad Srivastava, CSIT, GGV, C.G.

Social Media Data Collection

Collect and process social media data for political sentiment analysis using web scraping and API integration.

Data Collection Parameters
Hold Ctrl/Cmd to select multiple
Comma-separated list of keywords to track
Start Date
End Date
1,000 10,000 100,000
Hold Ctrl/Cmd to select multiple
Data Statistics
Sample Size Calculation

For sentiment analysis, the required sample size can be calculated as:

\[ n = \frac{z^2 \times p(1-p)}{e^2} \]

Where:

  • \( n \) = required sample size
  • \( z \) = z-score (1.96 for 95% confidence level)
  • \( p \) = estimated proportion (0.5 for maximum variability)
  • \( e \) = margin of error (typically 0.03 for sentiment analysis)

For a 95% confidence level and 3% margin of error: \[ n = \frac{1.96^2 \times 0.5(1-0.5)}{0.03^2} \approx 1067 \]

Collected Data Preview
Source Text Language Date
No data collected yet
Data Quality Metrics

Data quality is assessed using:

\[ \text{Quality Score} = \frac{\text{Valid Records}}{\text{Total Records}} \times 100\% \]

Where valid records meet criteria for language, relevance, and completeness.

Text Preprocessing Pipeline

Clean and prepare text data for sentiment analysis using NLP techniques.

Raw Text

"Modi is the best PM! #VoteBJP 🇮🇳"

Cleaning

Remove URLs, emojis, special chars

Tokenization

["Modi", "is", "the", "best", "PM", "VoteBJP"]

Normalization

Lowercasing, spelling correction

POS Tagging

[("Modi", "NOUN"), ("best", "ADJ"), ...]

Feature Extraction

TF-IDF, word embeddings

Text Processing Options
Part-of-Speech Tags
Modi (NOUN) doing (VERB) great (ADJ) work (NOUN) India (NOUN) economy (NOUN) growing (VERB) fast (ADV)
Processing Results
Original Text

Narendra Modi is doing great work for India! The economy is growing fast. #DevelopedIndia

Processed Text

narendra modi great work india economy grow fast developedindia

Python Code Example
# Text preprocessing with NLTK and spaCy import nltk import spacy import re def preprocess_text(text): # Convert to lowercase text = text.lower() # Remove URLs, mentions, and hashtag symbols text = re.sub(r'http\S+', '', text) text = re.sub(r'@\w+', '', text) text = re.sub(r'#', '', text) # Remove punctuation and numbers text = re.sub(r'[^a-zA-Z\s]', '', text) # Tokenize tokens = nltk.word_tokenize(text) # Remove stopwords stop_words = set(nltk.corpus.stopwords.words('english')) tokens = [token for token in tokens if token not in stop_words] # Lemmatization nlp = spacy.load('en_core_web_sm') doc = nlp(" ".join(tokens)) tokens = [token.lemma_ for token in doc] return " ".join(tokens) # Example usage original_text = "Narendra Modi is doing great work for India! #DevelopedIndia" processed_text = preprocess_text(original_text) print(processed_text) # Output: "narendra modi great work india developedindia"
Text Statistics

Text statistics help understand the complexity of the data:

\[ \text{Type-Token Ratio} = \frac{\text{Number of Unique Words}}{\text{Total Words}} \]

Higher ratios indicate more diverse vocabulary.

Sentiment Analysis and Prediction

Analyze political sentiment from text data and predict election outcomes.

Analysis Parameters
Model Performance

No analysis performed yet

Predicted Positive
Predicted Negative
Actual Positive
TP: 342
FN: 58
Actual Negative
FP: 42
TN: 358

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{342 + 358}{342 + 358 + 42 + 58} = 0.875 \]

\[ \text{Precision} = \frac{TP}{TP + FP} = \frac{342}{342 + 42} = 0.891 \]

\[ \text{Recall} = \frac{TP}{TP + FN} = \frac{342}{342 + 58} = 0.855 \]

\[ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = 0.872 \]

Sentiment Distribution
Positive: 45%
Neutral: 30%
Negative: 25%
Prediction Summary
BJP: 42%
Congress: 28%
AAP: 12%
Others: 18%

Predicted Seats: NDA: 295 | UPA: 145 | Others: 103

Sentiment-to-Seat Model: \[ \text{Seats} = \beta_0 + \beta_1 \times \text{Sentiment\%} + \beta_2 \times \text{Regional Weight} \]

Topic Modeling
Word Cloud Visualization: Economy, Development, Jobs, Corruption, Leadership, Price, Farmers, Education

Latent Dirichlet Allocation (LDA) for topic modeling:

\[ P(\text{word} | \text{topic}) = \frac{\text{Count(word in topic)} + \beta}{\text{Count(all words in topic)} + V\beta} \]

Where \( V \) is the vocabulary size and \( \beta \) is the Dirichlet prior.

Technical Details: NLP Methodologies

Comprehensive overview of natural language processing techniques for political sentiment analysis.

Natural Language Processing Fundamentals

Core concepts and techniques for processing and analyzing text data.

Text Representation Techniques

Text must be converted to numerical formats for machine learning:

Bag of Words (BoW)

\[ \text{Document} = [w_1, w_2, w_3, \ldots, w_n] \]

Simple frequency-based representation

TF-IDF

\[ \text{TF-IDF}(t,d) = \text{TF}(t,d) \times \log\left(\frac{N}{\text{DF}(t)}\right) \]

Weights words by importance

Word Embeddings

\[ \text{word} \rightarrow \text{dense vector} \]

Captures semantic relationships

Key NLP Tasks
Task Description Application in Election Analysis
Tokenization Splitting text into words or subwords Basic text preprocessing
Part-of-Speech Tagging Identifying grammatical categories Focus on adjectives for sentiment
Named Entity Recognition Identifying people, organizations, locations Tracking mentions of politicians and parties
Sentiment Analysis Determining emotional tone Measuring public opinion
Topic Modeling Discovering abstract topics Identifying key election issues
Sentiment Analysis Approaches
Sentiment Analysis Pipeline
1. Data Collection → Social media, news, forums
2. Text Preprocessing → Cleaning, normalization
3. Feature Extraction → TF-IDF, embeddings
4. Sentiment Classification → ML models, lexicons
5. Aggregation & Analysis → Trends, predictions
Mathematical Foundations

TF-IDF Calculation:

\[ \text{TF}(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]

\[ \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right) \]

\[ \text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t) \]

Cosine Similarity:

\[ \text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} \]

Used to measure similarity between documents or between queries and documents.

Model Architectures for Sentiment Analysis

Various machine learning and deep learning approaches for analyzing political sentiment.

Traditional Machine Learning Models
Naive Bayes

\[ P(\text{class}| \text{features}) \propto P(\text{class}) \prod P(\text{feature}|\text{class}) \]

Fast and works well with small datasets

Logistic Regression

\[ P(y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \cdots + \beta_nx_n)}} \]

Provides probability estimates

Support Vector Machines

\[ \min_{w,b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i \]

Effective in high-dimensional spaces

Deep Learning Models
LSTM Networks

\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]

Captures long-term dependencies in text

Transformer Models

\[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

State-of-the-art for NLP tasks

BERT

Bidirectional encoder representations

Pre-trained on large text corpora

Model Comparison
Model Accuracy Training Time Interpretability Best Use Case
Naive Bayes 75-80% Fast High Baseline, small datasets
Logistic Regression 80-85% Fast High Interpretable predictions
SVM 82-87% Medium Medium High-dimensional data
LSTM 85-90% Slow Low Sequential text data
BERT 90-95% Very Slow Low State-of-the-art performance
BERT Architecture Details

BERT uses transformer architecture with multi-head attention:

\[ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O \]

\[ \text{where head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \]

Pre-training objectives:

1. Masked Language Model (MLM): Randomly mask tokens and predict them

\[ \text{Loss} = -\sum_{i=1}^{N} \log P(\text{masked}_i | \text{context}) \]

2. Next Sentence Prediction (NSP): Predict if sentence B follows sentence A

\[ \text{Loss} = -\log P(\text{isNext} | \text{sentence}_A, \text{sentence}_B) \]

Model Evaluation Metrics

Methods for assessing the performance of sentiment analysis models.

Classification Metrics

Accuracy: \[ \frac{TP + TN}{TP + TN + FP + FN} \]

Precision: \[ \frac{TP}{TP + FP} \]

Recall: \[ \frac{TP}{TP + FN} \]

F1-Score: \[ 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]

Where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives

Confusion Matrix
Predicted Positive
Predicted Negative
Actual Positive
TP: 342
FN: 58
Actual Negative
FP: 42
TN: 358
Cross-Validation

K-fold cross-validation provides robust performance estimation:

\[ CV(k) = \frac{1}{k} \sum_{i=1}^{k} \text{Accuracy}_i \]

Where k is the number of folds (typically 5 or 10)

ROC Curve and AUC

Receiver Operating Characteristic curve plots True Positive Rate vs False Positive Rate:

\[ \text{TPR} = \frac{TP}{TP + FN} \]

\[ \text{FPR} = \frac{FP}{FP + TN} \]

Area Under Curve (AUC) provides aggregate performance measure across classification thresholds.

Deployment and Production

Strategies for deploying NLP models in production environments.

Deployment Architecture
Production Deployment Pipeline
1. Data Ingestion → Social media APIs, web scraping
2. Preprocessing → Text cleaning, normalization
3. Model Serving → REST API, TensorFlow Serving
4. Storage → Databases, data lakes
5. Visualization → Dashboards, reports
Model Serving Options
Approach Pros Cons Best For
REST API Simple, language-agnostic Higher latency Small to medium workloads
TensorFlow Serving High performance, versioning Complex setup TensorFlow models
ONNX Runtime Framework agnostic Conversion overhead Multi-framework environments
Edge Deployment Low latency, offline capability Limited resources Mobile applications
Monitoring and Maintenance

Critical aspects of production NLP systems:

  • Performance Monitoring: Track accuracy, latency, throughput
  • Data Drift Detection: Monitor changes in input data distribution
  • Concept Drift Detection: Identify changes in relationships between inputs and outputs
  • Model Retraining: Periodic updates with new data
  • A/B Testing: Compare model versions
Scalability Considerations

For high-throughput systems, consider:

\[ \text{Throughput} = \frac{\text{Number of requests}}{\text{Time}} \]

\[ \text{Latency} = \frac{\text{Total processing time}}{\text{Number of requests}} \]

Horizontal scaling can improve throughput:

\[ \text{Max throughput} = \text{Instances} \times \text{Throughput per instance} \]

Explanation: Understanding NLP for Election Sentiment Analysis

This section provides a comprehensive explanation of the methodologies, interpretations, and applications of NLP in political sentiment analysis.

Project Overview

This application demonstrates how Natural Language Processing (NLP) techniques can be used to analyze public sentiment toward political parties and predict election outcomes based on social media data.

Why Social Media Data?

Social media platforms have become the modern public square where political opinions are freely expressed. Analyzing this data provides:

  • Real-time insights into public opinion
  • Large volume of diverse perspectives
  • Geographic and demographic distribution
  • Unfiltered expressions of sentiment
Methodology Overview

The process involves several key steps:

  1. Data Collection: Gathering social media posts related to political entities
  2. Text Preprocessing: Cleaning and preparing text for analysis
  3. Feature Extraction: Converting text to numerical representations
  4. Sentiment Classification: Determining positive, negative, or neutral sentiment
  5. Aggregation & Prediction: Combining results to forecast election outcomes

How to Interpret the Results

Understanding the output of the sentiment analysis is crucial for drawing meaningful conclusions.

Sentiment Scores

Sentiment analysis models typically output a score between -1 (most negative) and +1 (most positive). In this application:

  • Positive Sentiment (0.05 to 1): Indicates support, approval, or favorable opinion
  • Neutral Sentiment (-0.05 to 0.05): Indicates factual statements or mixed opinions
  • Negative Sentiment (-1 to -0.05): Indicates criticism, disapproval, or negative opinion

Example: "Modi is doing great work for India's development" → Positive sentiment (score ~0.7)

Example: "The government failed to control inflation" → Negative sentiment (score ~-0.6)

Example: "Elections will be held in April" → Neutral sentiment (score ~0.0)

From Sentiment to Seat Prediction

Converting sentiment percentages to seat predictions involves a statistical model that considers:

\[ \text{Predicted Seats} = \beta_0 + \beta_1 \times \text{Sentiment\%} + \beta_2 \times \text{Regional Weight} + \beta_3 \times \text{Historical Performance} \]

Where:

  • β₀ is the baseline intercept
  • β₁ represents the impact of sentiment on seat share
  • β₂ accounts for regional variations in sentiment impact
  • β₃ incorporates historical voting patterns

Technical Implementation Details

This application employs multiple NLP techniques to ensure accurate sentiment analysis.

Handling Multilingual Content

Indian political discourse occurs in multiple languages. Our approach includes:

  • Language detection and separate processing pipelines
  • Language-specific sentiment lexicons
  • Translation of regional language content to English for model consistency
  • Cultural context consideration in sentiment interpretation
Addressing Sarcasm and Context

Political discourse often contains sarcasm and irony, which challenge sentiment analysis:

  • Contextual embedding models (BERT) capture subtle linguistic cues
  • Rule-based patterns identify common sarcastic constructions
  • Emoji and punctuation analysis provides additional sentiment signals

Limitations and Considerations

While powerful, NLP-based sentiment analysis has important limitations to consider:

Representativeness Bias

Social media users are not perfectly representative of the entire electorate. Younger, urban, and tech-savvy individuals may be overrepresented.

Algorithmic Bias

Machine learning models can inherit biases from training data. We mitigate this through:

  • Diverse training datasets across regions and demographics
  • Regular bias testing and model adjustment
  • Ensemble approaches combining multiple models
Temporal Dynamics

Political sentiment can change rapidly due to events, news cycles, and campaigns. Our models incorporate:

  • Time-weighting of recent data
  • Event detection and sentiment impact assessment
  • Trend analysis rather than point-in-time snapshots

Ethical Considerations

When conducting political sentiment analysis, we adhere to strict ethical guidelines:

Privacy Protection

All analysis is performed on aggregated, anonymized data. We never:

  • Identify or store personal information of individual users
  • Attribute sentiments to specific individuals
  • Use data for anything beyond statistical analysis
Transparency

We believe in transparent methodology including:

  • Clear documentation of data sources and processing methods
  • Openness about limitations and potential biases
  • Explanation of how predictions are generated

Future Directions

We are continuously working to improve our models through:

Advanced Techniques
  • Multimodal analysis combining text, image, and video content
  • Graph analysis of information spread and influence networks
  • Real-time sentiment tracking during key political events
  • Cross-platform sentiment integration
Application Expansion
  • Sentiment analysis for specific policy issues
  • Regional and demographic breakdowns
  • Longitudinal studies of sentiment evolution
  • Integration with traditional polling data

ROC (Receiver Operating Characteristic) Curve

The ROC curve is a fundamental tool for evaluating the performance of classification models, including sentiment analysis systems.

What is an ROC Curve?

An ROC curve is a graphical representation of a classification model's performance across all classification thresholds. It plots two parameters:

  • True Positive Rate (TPR) also known as Sensitivity or Recall: \[ TPR = \frac{TP}{TP + FN} \]
  • False Positive Rate (FPR) also known as Fall-out: \[ FPR = \frac{FP}{FP + TN} \]
ROC Curve Example
Interpreting ROC Curves

The ROC curve provides valuable insights into model performance:

  • Top-left corner: Ideal point with FPR=0 and TPR=1, representing perfect classification
  • Diagonal line: Represents random guessing (AUC = 0.5)
  • Above the diagonal: Indicates model performance better than random chance
  • Below the diagonal: Suggests performance worse than random chance (could be inverted)

Example Interpretation: A model with ROC curve that approaches the top-left corner indicates high true positive rates while maintaining low false positive rates across thresholds.

ROC Curve in Sentiment Analysis

In sentiment analysis applications, ROC curves help:

  • Compare different classification algorithms
  • Determine optimal threshold for positive/negative classification
  • Visualize trade-offs between true positives and false positives
  • Evaluate model performance across different sentiment classes

For multi-class sentiment analysis (positive, negative, neutral), ROC curves can be created for each class using a one-vs-rest approach.

Practical Considerations

When using ROC curves for sentiment analysis evaluation:

  • ROC curves are particularly useful when class distributions are imbalanced
  • They provide a visual representation of classification trade-offs
  • ROC analysis helps select appropriate operating points based on application requirements
  • For sentiment analysis, the cost of false positives (misclassifying negative as positive) may differ from false negatives

AUC (Area Under the ROC Curve)

The Area Under the ROC Curve (AUC) provides a single-number summary of classifier performance across all possible classification thresholds.

What Does AUC Measure?

AUC represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance. It provides an aggregate measure of performance across all classification thresholds.

The AUC value ranges from 0 to 1, where:

  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: No discriminative power (equivalent to random guessing)
  • AUC < 0.5: Worse than random guessing (may indicate reversed predictions)
  • AUC > 0.9: Excellent classifier
  • AUC > 0.8: Good classifier
  • AUC > 0.7: Fair classifier
Interpreting AUC Values

AUC provides a robust measure of classifier performance that is insensitive to class distribution and classification threshold:

AUC Interpretation Guide

Example: An AUC of 0.85 means there's an 85% chance that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

AUC Advantages for Sentiment Analysis

AUC is particularly valuable for evaluating sentiment analysis models because:

  • It's threshold-invariant, measuring quality of predictions irrespective of cutoff choice
  • It's scale-invariant, measuring how well predictions are ranked rather than absolute values
  • It handles class imbalance well, which is common in sentiment data
  • It provides a single metric for comparing different models
Limitations of AUC

While AUC is a valuable metric, it has some limitations:

  • It doesn't provide information about actual error rates or costs
  • It can be overly optimistic for imbalanced datasets in certain cases
  • It doesn't indicate optimal operating point for specific applications
  • It may mask poor performance in specific regions of the ROC space

For comprehensive model evaluation, AUC should be used alongside other metrics like precision, recall, and F1-score, especially when the costs of different types of errors vary.

Bag of Words (BoW) Model

The Bag of Words model is a fundamental text representation technique in natural language processing that simplifies text data for machine learning algorithms.

What is the Bag of Words Model?

The Bag of Words model represents text as a "bag" (multiset) of its words, disregarding grammar and word order but keeping track of word frequency. It creates a vocabulary of all unique words in the corpus and represents each document as a vector of word counts.

Example:

Document 1: "The party leader gave a strong speech"

Document 2: "The speech was strong and powerful"

Vocabulary: ["the", "party", "leader", "gave", "a", "strong", "speech", "was", "and", "powerful"]

BoW vectors:

Document 1: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]

Document 2: [1, 0, 0, 0, 0, 1, 1, 1, 1, 1]

Mathematical Representation

The Bag of Words model can be represented mathematically as:

Given a vocabulary \( V = \{w_1, w_2, \ldots, w_n\} \) of size \( n \),

Each document \( d \) is represented as a vector: \( \vec{d} = (c_1, c_2, \ldots, c_n) \)

Where \( c_i \) is the count of word \( w_i \) in document \( d \).

This representation creates a document-term matrix where rows correspond to documents and columns correspond to terms in the vocabulary.

Applications in Sentiment Analysis

The Bag of Words model is widely used in sentiment analysis because:

  • It provides a simple way to convert unstructured text into structured data
  • It captures word frequency information that often correlates with sentiment
  • It works well with traditional machine learning algorithms
  • It can be enhanced with techniques like TF-IDF weighting

Example: In political sentiment analysis, words like "development", "progress", and "strong" might frequently appear in positive sentiment documents, while words like "corruption", "failure", and "weak" might appear in negative sentiment documents.

Limitations and Enhancements

While simple and effective, the Bag of Words model has several limitations:

  • Loss of word order: "not good" and "good not" are represented identically
  • Loss of semantic meaning: Doesn't capture relationships between words
  • Vocabulary size: Can create very high-dimensional sparse vectors
  • Ignore context: Doesn't consider the context in which words appear

Common enhancements to address these limitations include:

  • N-grams (capturing word sequences)
  • TF-IDF weighting (reducing importance of common words)
  • Stop word removal
  • Stemming and lemmatization
Python Implementation Example
# Bag of Words implementation using sklearn from sklearn.feature_extraction.text import CountVectorizer # Sample documents documents = [ "The government announced new development policies", "Corruption allegations against the minister", "Economic growth shows positive trends" ] # Create CountVectorizer instance vectorizer = CountVectorizer() # Fit and transform the documents X = vectorizer.fit_transform(documents) # Get feature names feature_names = vectorizer.get_feature_names_out() # Convert to array and display print("Vocabulary:", feature_names) print("Document-term matrix:") print(X.toarray()) # Output: # Vocabulary: ['allegations', 'against', 'announced', 'corruption', 'development', # 'economic', 'growth', 'minister', 'new', 'policies', 'positive', # 'shows', 'the', 'trends'] # Document-term matrix: # [[0 0 1 0 1 0 0 0 1 1 0 0 1 0] # [1 1 0 1 0 0 0 1 0 0 0 0 1 0] # [0 0 0 0 0 1 1 0 0 0 1 1 0 1]]

Lemmatization in NLP

Lemmatization is a text normalization technique in natural language processing that reduces words to their base or dictionary form, known as the lemma.

What is Lemmatization?

Lemmatization uses vocabulary and morphological analysis to remove inflectional endings and return the base or dictionary form of a word. Unlike stemming, which uses heuristic rules, lemmatization considers the context and part of speech to determine the lemma.

Examples:

  • running → run
  • better → good
  • went → go
  • are → be
  • mice → mouse
  • policies → policy

Formally, lemmatization can be defined as a function:

\[ \text{lemma}(w) = \arg\min_{l \in L} \text{similarity}(w, l) \]

Where \( L \) is the set of all possible lemmas and similarity is determined through linguistic rules and dictionary lookups.

Why Use Lemmatization?

Lemmatization provides several benefits in text processing and NLP applications:

  • Reduces dimensionality: Groups different forms of the same word
  • Improves feature consistency: Ensures same lemma is represented consistently
  • Preserves meaning: Maintains semantic relationships between words
  • Enhances model performance: Helps machine learning models generalize better

Example in Sentiment Analysis:

Without lemmatization: "The government is improving, improvements continue, improved results"

With lemmatization: "The government is improve, improve continue, improve results"

This allows the model to recognize all these forms as related to the concept of "improve".

Lemmatization vs. Stemming

While both techniques reduce words to their base forms, they differ in important ways:

Aspect Stemming Lemmatization
Approach Rule-based, heuristic Dictionary-based, morphological analysis
Output Word stem (may not be a valid word) Lemma (always a valid word)
Context awareness No Yes (considers part of speech)
Accuracy Lower Higher
Computational cost Lower Higher
Example "running" → "run", "better" → "better" "running" → "run", "better" → "good"
Implementation in Python

Popular NLP libraries provide lemmatization capabilities:

# Lemmatization using NLTK import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet # Download required resources (first time only) # nltk.download('wordnet') # nltk.download('omw-1.4') # nltk.download('averaged_perceptron_tagger') # Initialize lemmatizer lemmatizer = WordNetLemmatizer() # Simple lemmatization print(lemmatizer.lemmatize("running")) # Output: running print(lemmatizer.lemmatize("running", pos='v')) # Output: run # Lemmatization with POS tagging def get_wordnet_pos(treebank_tag): """Convert treebank POS tag to wordnet POS tag""" if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return wordnet.NOUN # Default to noun # Lemmatize with POS context sentence = "The government's policies are improving economic conditions" tokens = nltk.word_tokenize(sentence) pos_tags = nltk.pos_tag(tokens) lemmatized = [] for word, tag in pos_tags: wn_tag = get_wordnet_pos(tag) lemma = lemmatizer.lemmatize(word, pos=wn_tag) lemmatized.append(lemma) print("Original:", sentence) print("Lemmatized:", " ".join(lemmatized)) # Output: "The government 's policy be improve economic condition"
# Lemmatization using spaCy import spacy # Load English model nlp = spacy.load("en_core_web_sm") # Process text doc = nlp("The governments are implementing better policies for development") # Extract lemmas lemmas = [token.lemma_ for token in doc] print("Original: The governments are implementing better policies for development") print("Lemmatized:", " ".join(lemmas)) # Output: "the government be implement good policy for development"
Considerations for Sentiment Analysis

When applying lemmatization to sentiment analysis tasks:

  • Lemmatization can help group sentiment-bearing words with different inflections
  • It may sometimes obscure nuances (e.g., "better" → "good" loses comparative meaning)
  • For some applications, preserving certain inflections might be important for sentiment
  • It's essential to evaluate whether lemmatization improves model performance for your specific task

Political Sentiment Example:

Original: "The candidate's promises are convincing, and voters were convinced by the arguments"

Lemmatized: "The candidate's promise be convince, and voter be convince by the argument"

This normalization helps the model recognize the consistent sentiment across different forms of "convince".